Introduction to statistical models
R and packages
- “Base” R
- Dates back to 1990s
- Now at version 4.2.2
- Evolves a bit, but nothing dramatic
- Packages
- Add-ons for extra functionality
- Anyone can write: much more dynamic
- (If possible, best to keep RStudio, R and packages up to date.)
“Tidy” data in R
- “Tidy” has specific meaning, i.e. that data are in table form with
- row = an observation (“object”)
- column = a variable (“measurement type”)
- each cell containing one value
| 5224360 |
S01008082 |
False Alarm (UFAS) |
2 |
2014-09-11 |
| 7478338 |
S01009650 |
Other Primary Fire |
3 |
2017-12-30 |
| 5716405 |
S01012100 |
False Alarm (Dwelling) |
2 |
2015-06-24 |
| 7832270 |
S01009860 |
Dwelling Fire |
5 |
2018-07-02 |
| 5921500 |
S01011244 |
Outdoor Fire |
2 |
2015-10-15 |
- Strong connections to SQL
- Terminology: “Table”, “Data Frame”, “Tibble” are mostly interchangeable.
Tidyverse packages - dplyr
- Functions to manipulate data table(s):
select to pick columns
filter to pick rows
group_by and tally to summarise
mutate to add new columns
left_join, full_join, etc, to join different tables
Packages - installing and loading
- If a package isn’t already installed we can install with:
install.packages("dplyr")
- After it’s installed we can load with the
library command:
This puts all the package’s functions, data sets, etc, into the “environment”.
- We can use
library(tidyverse) to load in all the tidyverse packages in one go (including dplyr)
- In RStudio
Tools -> Check for Package Updates then Select All and Install Updates keeps packages up to date.
Getting started for today
- First we’ll do
Session -> Set Working Directory to select a suitable folder (where we’ve put incidents.rds).
- Then we’ll load the
incidents data file:
incidents <- readr::read_rds('incidents.rds')
- (The
:: tells R to look for read_rds in the readr package.)
- Then
glimpse lets us see what we’ve loaded:
dplyr: filtering and selecting
filter(incidents, DataZone == "S01012100")
select(incidents, DateCreated)
select(filter(incidents, DataZone == "S01012100"), DateCreated)
(Tip: in RStudio pressing the TAB key often helpfully autocompletes function and variable names)
These commands are “standard” code. They work fine but can soon become hard to read if we make multiple function calls. This is where “piping” helps.
The pipe, |>
- That previous command read out loud is: “Take incidents data then filter (down to a particular
DataZone) then select the DateCreated variable.”
incidents |> filter(DataZone == "S01012100") |> select(DateCreated)
- The pipe,
|>, is read as “then”
- Can string together “pipelines” of commands that remain readable
- Pipelines are easy to “break up” when developing/checking code, e.g. copy-and-pasting the first bit(s) of it
- Note:
|> is now in “base R”, replacing the older %>% which previously needed a package
dplyr: grouping and summarising
group_by works in conjunction with another function that performs an operation “by group”
- In our reports, we often use
group_by with tally:
incidents |> group_by(DateCreated) |> tally() |> plot()
A note on workflow
- In workflow to date we have:
- worked with “data dump”s exported from SQL databases as CSV files
- loaded these into local computer memory
- performed calculations on local CPU
- Issues with this include:
- Data security
- Data integrity/versioning
- Storage/compute limitations
- Much better (if possible) is to connect directly to SQL databases
dbplyr enables doing this and using dplyr code as if the data were local
Some examples from our reports
What do the following do?
(NB: acorn is a table with each row being a property)
acorn_tallied <- acorn |> group_by(DataZone) |> tally(name = 'n_properties')
incidents_tallied <- incidents |>
group_by(DataZone, ACORN_CAT, Risk_Level) |> tally(name = 'n_incidents')
Answers:
Counts the number of properties in each data zone
Counts how many incidents there were in each (DataZone, ACORN_CAT, Risk_Level) combination
Some things to try
- Determine how many incidents there were of each risk level
- Consider all the incidents in data zone
S01012315
- Filter to obtain these.
- How many were there?
- Of these, how many were of type “Dwelling Fire”?
- Of all the incidents with risk level 5, find the number of fire casualties involved.
- Pipe the output into
table() to make a frequency table.
- Pipe that outputted table into
barplot() to show the data as a plot.
